ROCm(Radeon Open Compute)生態系統是一個模組化、分層的軟體堆疊,旨在連接開放原始碼硬體與高效能運算。它並非單一的驅動程式,而是一種 流程現實——一系列部署階段,確保環境穩定且可重現。
1. 模組化堆疊架構
ROCm 各組件彼此解耦,以支援精細的擴展。整個堆疊從 AMDGPU 核心驅動程式 經由 ROCT(封裝函式)、 ROCR(執行時),最後到達 HIP API 和數學函式庫。這種架構要求有系統化的入門工作流程。
2. 部署的生命週期
平台現實強調嚴格的相依性鏈條:必須將核心版本與 支援矩陣對齊,初始化經過 GPG 簽章的儲存庫,透過原生套件管理器解決相依性,並設定 PATH 以及 render 群組,以將硬體介面暴露給 HIP。
main.py
TERMINALbash — 80x24
> Ready. Click "Run" to execute.
>
QUESTION 1
Which component acts as the 'authoritative gatekeeper' in the ROCm deployment workflow?
The HIP Runtime API
The Support Matrix
The GPG Repository Key
The LLVM Compiler Backend
✅ Correct!
Correct. The Support Matrix defines the compatible intersection of hardware, OS distributions, and kernel versions.❌ Incorrect
The Support Matrix must be verified first to ensure hardware/software compatibility before any API or keys are used.QUESTION 2
What is the primary purpose of 'Repository Bootstrapping'?
To compile the kernel driver from source.
To establish a trusted link to AMD servers via GPG keys and source mapping.
To allocate VRAM for the first time.
To convert CUDA code to HIP code automatically.
✅ Correct!
Yes. Bootstrapping ensures the system can securely pull authentic ROCm binaries and headers.❌ Incorrect
Bootstrapping is about metadata and trust (keys/sources), not compilation or memory allocation.QUESTION 3
Why does the shell usually report 'command not found' for
hipcc immediately after installation?The installation failed silently.
The user lacks permissions to execute the file.
ROCm binaries reside in non-standard versioned directories (e.g., /opt/rocm/bin).
The kernel fusion driver (KFD) is not loaded.
✅ Correct!
Correct. ROCm tools are installed in versioned directories to allow co-existence; the PATH must be manually updated.❌ Incorrect
The issue is visibility. The binaries exist but are not in the system's standard executable path.QUESTION 4
Which system group is required for a user to access GPU device files like
/dev/kfd?admin
render (or video)
amd-drivers
compute-users
✅ Correct!
Correct. The Linux security model restricts direct hardware interaction to members of the 'video' and 'render' groups.❌ Incorrect
Linux uses the standard 'render' or 'video' groups for GPU device access.QUESTION 5
What does the
rocminfo utility verify?Hardware temperature and clock speeds.
The successful handshake between user-space libraries and the kernel driver.
Code syntax errors in HIP applications.
Internet connectivity to AMD's update servers.
✅ Correct!
Yes. rocminfo checks if the HSA (Heterogeneous System Architecture) agents are reachable.❌ Incorrect
Temperature is checked via rocm-smi; rocminfo is for stack health and topology.Case Study: Scaling LLM Training on a Fresh Cluster
Dependency Resolution and Permissions
A DevOps engineer is setting up a new multi-GPU server for LLM training. They have installed the `amdgpu-dkms` package, but the training application fails with `hsa_init() failed`. The engineer notes that the user is not in any special groups and the environment variables are default.
Q
Based on the ROCm Platform Reality, which missing step is likely causing the 'hsa_init() failed' error?
Solution:
The user is likely missing membership in the 'render' or 'video' groups. Even if the driver is correctly installed, the application cannot open the `/dev/kfd` device file without these group permissions.
The user is likely missing membership in the 'render' or 'video' groups. Even if the driver is correctly installed, the application cannot open the `/dev/kfd` device file without these group permissions.
Q
Which command should the engineer run to grant the necessary hardware access to the current user?
Solution:
sudo usermod -aG render,video $USER followed by a full logout and login to refresh the session tokens.Q
If the application still cannot find the HIP compiler, what environmental change is required?
Solution:
The engineer must append the ROCm bin directory to the PATH variable:
The engineer must append the ROCm bin directory to the PATH variable:
export PATH=$PATH:/opt/rocm/bin.